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ABSTRACT 

Item response theoretic methds are applied to the 
measurement of achievement of students from various instructional 
backg*:ounds. This extended item response theory (IRT) approach serves 
as a tool for studying instructional bias, or instructional 
sensitivity* The model maintains the form of an IRT model, but has 
parameters that quantify the extent of the effect attributed to 
opportunity to learn (OTL) . The technique is applied to detect 
instructional sensitivity using the Second International Mathematics 
Studi' (SIMS) set of 4C --ore items for eighth-graders in the United 
States. Tha SIMS data came from about 280 schools and about 7,000 
students measured at the end of spring 1982. The achievement test 
used contained 180 items in the areas of arithmetic, algebra, 
geometry, and measurement distributed among four test forms. In the 
SIMS data, there was considerable heterogeneity in the mathemati^js 
instruction experiences of stUvJents. The model features parameters 
estimating the influence of student background and OTL content 
pertinent to each specific test item on a single latent mathematics 
ability trait, and the effects of the mathematics ability trait and 
the item-specific OTL on the difficulties of test items. The analysis 
indicates that certain test items representing early stages of 
learning about selected mathematical topics were particularly 
sensitive tc specific instruction. An 18--item Ijst of references is 
included. (SLD) 
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Introduction 



Standardized achievement testing In most American schools today involves a 
heterogeneous group of students. One ma|or source of this heterogeneity at a given 
grade level Is the difference in li.structlonal experiences of students (e.g., McKnlght 
et al., 1987; It Is little wonder that the match between the school curriculum and 
what is tested continues to be of concern, e.g., Airasian and Madaus (1983), Haertle 
and Calfee (1983), Linn (1983), Schmidt, Porter, Schwille, Floden, Freeman (1983), 
Lelnhardt (1983), Leinhardt and Seewald (1981), Mehrens and Phillips (1986), and 
Miller (1986). 

The research reported here extends our developments of item response 
theoretic methods for achievement of heterogeneous groups of students (Murhen, 
1987a b) Within this framework, the present study expands o-i efforts to 
disentangle the influences of ascriptive instructional backgrounds as they impact 
estimation of the parameters of the achievement measurement model. Tlie 
emphasis here is on how one might model the effects of difference in instructional 
backgrounds of students on the resulting achievement latent trait and observed item 
difficulties. This work is being reported at a relatively early phase of the inquiry (n 
order to call attention to what we view to be a potentially fruitful psychometric 
method for examining achievement test data obtained from students with varying 
instructional backgrounds. It is hoped that presentation of the research at this stage 
will stimulate discussion about the applicability of the methodology for research and 
practice within the domain of large-scale Instructional testing. 

Item Response Theory (IRH Is a common tool for the study of item bias. 
Under the IRT model, invariance of measurement parameters is assumed to hold for 
different subgroups. Deviations from this assumption are viewed as item bias. Tc 
detert bias, the group membership of the examinees is identified and the estimated 
curves describing the probability of a correct answer for a given abi .ty level are 
compared across groups. A large area between curves is an indication of IRT item 
bias. 

As suggested by Linn and Harnisch (1981), "instrurtional bias" • ay be 
mistaken as bias due to ethnidty. Recent studies have changed tbo tiddltional focus 
on ethnic and gender biases In achievement tests to Instructiona- bias. For instance, 
Lehman (1986) studied algebra items for eighth grade students. Gender and 
opportunity-to-learn (OTL; Anderson, 1988) in the classroorr^ were used as group' ng 
variables. Relative to gender, OTL was found to be a much more import.L^t cau;c of 
item bias. Miller and Linn (1986) used c.n alternative approaoi to the stu jy 
instructional bias. Based on OTL and item content, cluster analysis was carrl^Hi out to 
create curriculum clusters. When comparing item response curves for the same item 
aaoss clusters, they found strong evidence of Instructional bias. The magnitude or 
the Instructional bias was claimed to be larger than that usually found with different 
ethnic groups. 

The Lehman and Millei-Llnn approaches build on grouping test-takers. The 
grouping may depend on the sample distribution. There Is also the drawback of 
basing the estimation of an Item's parameters In a certain group (cluster) on students 
that may well have a wide range of OTL. Different group aiterla may lead to 
different conclusions. 

Standard IRT techniques assume that instruction inaeases the item 
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performance through an Increase in the latent trait level, while the Item-tralt 
relationship remains the :ame. nils assumption Is usually too strong for groups of 
students with widely different content coverage. Certain classes may have obtained 
more extensive Instruction for specific content areas so that the performance on 
the corresponding Item types is relatively better than on the majority of the Ite.ns 
for the average student. This Is the cause of Instructional Item bias. Muthen (1987a) 
pointed out the psychometric problem of traditional IRT-based Item bias detection 
schemes, showing a misestlmatlon of bias in the plausible situation of many items 
showing Instructional bias. Muthen's extended IRT model may serve as a better tool 
for studying the instructional bias, or, as we will term it, Instructional sensitivity His 
model maintains the form of an IRT model, but in addK'on his parameters which 
quantify the extent of the effect attributed to OTL. Usng similar modeling, Muthen 
(1987b) also cons ders other educational and social student background Information 
as predictors of Item response. As Mislevy (1987) indicated, "what IRT models miss 
are these systematic differences among examinees performing at the same general 
level" (pp. 261-262). The assumptions of IRT which preclude the influences from 
auxiliary variables are challenged and examined In Muthen's model. 

Muthen's model may be brieHy desalbed as follows. Building on the 
statistical theory of Muthen (1984), Muthen (1987b) proposed a new extension of 
IRT modeling that controls for student background differences by including 
background variables as covarlates. Further extending this methodology, Muthen 
(1987a) proposed a method for explicitly Including item-specific Information on 
Instructional differences, allowing for OTL effects on performance not only through 
an increase in trait level, but also directly. This model parameterization essentially 
allows for several difficulty levels for each Item corresponding to different 
Instructional classifications. In this way, the deficiency of traditional IRT bias 
deteoion techniques Is avoided. The instructional heterogeneity of the students Is 
taken Into account and any differential Instructional effeas on the Item difficulty 
p,irarr.eters can be directly estimated. 

The Muthen (1987a) technique for detecting instructlonally sensitive items 
wa« Illustrated with a very small set of 8 algebra Items from the US sample of eighth 
graders In Second International Mathematics Study (SIMS), Crosswhite, Dossey, 
Swafford, McKnlght, and Cooney (1 '85). The aim of this paper Is to apply the 
technique to detect Instructional sensitivity In a more realistic setting, using the 
S.MS set of 40 core Items for U.S. eighth graders. This set contains Items covering 
algebra, arithmetic, geometry, and measurement. By this analysis. It is hoped that 
types of Items that are particularly susceptible to instructional sensitivity In this 
context can be discerned. Such Items may be less suitable tn activities of broad 
assessment of more stable traits, but may be of primary Interest for achievement 
assessment. The achievement measurement process can be Improved by better 
understanding the link between item types and Instrurtion in this way. 
Furthermore, Item analysis by standard IRT techniques would Ignore Instructlonally 
sensitive Items and result In biased estimates of measurement parameters. 



ITie Data 

In brief, the SIMS data features are as follows. A national probability sample 
of school districts were selected proportional to size; a proablllty sample of schools 
were selected proportional to size within school dlstria: and two classes were 
randomly selected within e3 h school yielding a total of about 280 schools and about 
7,000 students measured at he end of spring 1982. The achievement test 
contained 180 Items In the areas of arithmetic, algebra, geometry, and measurement 
distributed among four test lorms. Each student responded to a core test (40 Items) 
and one of four randomly assigned rotated forms (34 or 35 items) . All Items were 
presented In a five category multiple choice format. 



In the analysis th?' follows, a key piece of Instructional Information was 
obtained as follows. For each item teachers were asked two questiotii regarding 
student opportunity to learn. 

Question 1: 

•Dl ring this school year did you teach of review the mathematics needed to 
answer the item correctly?" 

1. No 

2. Yes 

3. No response 

Question 2: 

"If in this school year you did not teach or review the mat^icmatics needed 
to answer this item correctly, was it mainly because?" 

1. It had been taught prior to this school year 

2. It will be taught later (this year or later) 

3. It is not in the school curriculum at all 

4. For other r .asons 
9. No response 

Given these responses, opportunity-to-learn (OTL) level will be classified as 

follows: 

OTL: Question 1 = 2, question 2 = 9 

or question 1 = 1, of 9 and question 2=1 

No OTL (NTL): Question 1 = 1, question 2 = 2, 3, 4, or 9 

or question 1 = 9, question 2 = 2, 3, or 4 

Other response combinations lead to ihe elimination of the observation. 

The percentage distribution of OTL categories for all 40 items are given in 
the left-most part of Table 1 (See Appendl.x A) together with proportion correct. 

It seems that the percentage of students having no O^L (category NTL) 
varies greatly across the items. With the exception of 5 items, havirig had OIL is 
most common. However, about 1/3 of the items show NTL proportions larger than 
0 33 It is also seen that the proportion correct varies greatly over the different OIL 
categories. These are clear Indications of the .tudent heterogeneity. 

Tlie use of the dichotomously scored, teacher-reported O'lX in our model is 
noteworthy. Mehr-ns and Phillips (1986) used textbook series and school 
personnel fating. study the influence of the 'natch between ^^at was taught and 
what was tested for reading and math scores in grades 3 and 6. As Lelnhardt ana 
Seewald (1981) pointed out, the two most common approaches to the measurement 
of overlap between what Is tested and what is taught are instructional-based and 
curriculum-based measurement. 

In the SIMS, student-reported item-specific OTL is also available Both 
teacher and student reported OTL is presumably fraught with error. Jeachers 
reoortlnc may not be relevant for a student who was absent from or did not 
understand the Instruction. Students' reporting may part y reflect his/her 
perception of the item difficulty. The two ways of reporting are no highly 
correlated (Lehman, 1986). We feel that the teacher-reported OTL is more 
trustworthy. 
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In prei'mlnary analysis, we considered using three-category OTL 
measurementj corresponding to OTL this year, OTL prior year(s), and no OIL. 
However, this approach was abandoned in favor of using dlchotomous OTL for t.ic 
following conceptual and technical reasons. First of all, the prior year effect may l>e 
hard to estimate since prior year OTL Is not distinctly defined, but may refer to OTL 
more than a year ago as well as OTL late In the previous year. Second, many Items 
showed low percentages for the prior year on, category, leading to unstable 
estimates. Hilrd, use of the three-category OTL variables lead to high correlations 
between several Items' prior year and this year OTL measurements, resulting In 
multlcolUnearlty among the predictors. 

Preliminary analyses also found probable misrcportlng by a teacher. For two 
Hems, the no OTL category was made up of 24 students from one class who all got 
the Items right. Plotting the sum of correct answers versus the sum of the 
dichotomously scored OTL, this class was found to be a distinct outlier with very high 
performance and rather low OTL. For these two reasons, this class was deleted from 
the analyses to be presented. 

In addition to the above Item-specific OTL information. Instructional 
background Information common to all Items Is available In the SIMS In the form 
classification of each mathematics class Into one of four types, basic of remedial 
arithmetic (REMEDIAL), general of typical mathematics (TVPICAL), pre-algebra or 
enriched (ENRICHED), and algebra (ALGEBRA). This classification Is based of 
teacher questionnaire data and on Information on textbooks used. 

In the SIMS data there is also available a set of background variables for each 
student me3:,ured during the Fall of eighth grade. These variables include "preset 
measurements of mathematics, family background, educational aspiration, attitudes 
toward mathematics, gender, ethnicity; see Table 2 and also Muthen (1987b). 

The premeasurements were only collected for part of the sample. The 
analysis considers a total number of 3,724 students who had complete observation 
on both fall and spring measurements in this set. This analysis sample Involves 198 
classes. 



The Model 

Following Muthen (1987a), detection of Instructlonally sensitive items 
among the items is achieved by estimation of the following model. A diagrammatic 
representation of the model Is given In Figure 1 ^See Appendix A). The model will 
first be desalbed In words and then statistically. 

The mathematics trait In the spring of eighth grade Is an unobserved 
continuous variable that Is measured by, or In other words, predicts, the set ot test 
items This trait will alternatively be called math ability or achievement level, 
although a more careful distinction Is no doubt desirable when discussing a trait for 
students with varying OTL. Muthen (1987a) suggests the term -latent performance 
level " We want to study the effect of OTL on the Item performance since it is 
possible that having OTL enhances the specific skills needed to solve the 
corresponding Item rciiectly. Adding these variables as predictors, the modeling has 
to recognize th^.t math ability In the spring is an endogenous variable relative to the 
OTL variables. The OTL variables predict the Item performance but also determine a 
'Mrt of the math ability level Itself. To correctly model the prediction of spring 
math ability. It then becomes necessary to specify a more comprehensive set of 
predictor for math ability, where OTL Influence on math ability Is specified as partial 
effects, holding other background variables constant. 
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spring math ability Is here taken to be predicted by fall pretests, attitudes, 
family background, demographics, class type, and OTL. These predictors Influence 
the math ability variable and thus, indirectly, also the performance on the test 
items. The ma|ority of the background \ irlables are assumed to only correlate 
because of their common influence on the math ability variable. 

The OTL variables, however, are also allowed to Influence the corresponding 
test items directly, although not all Items are expected to have such effects. Any 
such effect would be an influence of OTL over ai^d above that which is transterred 
via the math ability. Hence, the probability of a correct resoonse for students with 
different OU would be different even if they have the same math ability. Ihis 
effect implies item bias due to Instructional sensitivity in the Item at hand. This can 
be stated as OTL not influencing math ability homogeneously aaoss the set of tost 
items It is interesting to note that bias due to instructional sensitivity In the items 
is assessed here without resorting to traditional item bias detection schemes which 
necessitate a classiflcation of students Into groups with different OTL values. The 
present analysis avoids the arbitrariness of such groupings In a situation where group 
membership obviously varies across items. The model also presents a wealth of 
other relevant information on the achievement process. 

More technically, the model may be presented as follows. An IRT model is 
specifled for measuring the trait by the set of items, hi this analysis a two-parameter 
normal ogive response curve model is chosen for this measurement part (e.g. Lord, 
1980) Let us consider the influence of the item-specific OTL variables, ^ say, and 
the student background variables, x say (premeasurements, attitudes, demograph'cs, 
and class type). In our analysis we will create an OTL dummy variable for each item 
j, Zj = 1 represents OTL. The variable in, say. We specify the linear regression 

model, 

(1) Ti =Yx' x+y^'z + C 

where x and z are vectors of variables and C is a normally distributed residual with 
zero mean, variance and where ; is independent of x and z. 

In addition to the part of predicting ti, specify an Influence from the z 
variable for a certain item to the response for that particular item. While each 
item's z variable influences the item response through the ti variable, this part of 
the model concerns the direct Influence from the z to the item, over and above that 
which goes through ti. U is convenient to express the direct influence of the z 
variables on the Items using a latent response variable formulation, where 

(2) y, = 0, if y* j < tj 

1, otherwise 

where Tj is a threshold parameter defined on the continuous latent response 
variable y*. = Ajti + BjZj + ty 

The latent response variable may be viewed as the speciflc skill needed to solve the 
corresponding item correctly; when the latent response variable exceeds a 
threshold, the item is correctly answered. We assume that ej is a residual .Wth mean 
zero that is Independent of \\ and the z's. By adding the assumption that ej has a 
normal distribution, the standard normal ogive model of IRT Is obtained, except that 
OTL is allowed to have direct Influence on the Item. 

In effect this specification allows Items to have different difficulty for 



different OIL levels (cf. Muthen, 1987a). The shift in difficulty is provided by the B 
parameter. The parameters of tnis model may be translated to those of .standard IRl, 
so that each item obtains one disaimlnatlon paramet^^r value and, in the present 
case of 2 On. categories. 2 dlffiailty parameter values. ITie formulas for the 
translation are as follows. Ttie conditional variance of y*j given the x and z variables 
is standardized to 1, resulting in a residual (e) variance ©jj = ^ - jV- Let the mean 
and variance of h be denoted and a^^, respectively. It can then be shown that 
the two-parameter normal ogive parameters a (di.scrimination) and b (ditficulty) for 
item j can be wiltten as 

(4) a| = Aj ejj .l/2aj|l/2, 

(5) bji,=::((t,-BjZ,)Aj 1 -n^]a^^.l/2, 

In these formulas, the trait has been standardized to mean zero and variance one. 
The estimated values of a and b may be obtained by Inserting model parameter 
estimates In (4) and (5), where the sample means, variance, and covarlances for the 
x's and the z's are also used to compute the estimated and a^,^. For each Item we 
can then obtain 2 estimated Item characteristic curves and compute differences 
between these curves. In this paper we will choose to use the simple index (called 
D) discussed by Linn, Levlne, Hastings, Wardop (1981), where squared probability 
differences are added up over the trait range -3 to +3. 

Inserting (1) In (2) gives the so-called reduced-form for the regression of the 
y*'s on the x's and z's. These are profit regressions, where the model Imposes 
restrictions on the problt slopes and residual correlations. The slopes are expressed 
by the y, B, and A parameters of the model, while the residual correlations also 
involve the remaining parameter y for the residual variance. The parameters may 
be estimated by fitting the model to the problt regression slopes and correlations. 

Muthen (1987c) describes the LISCOMP computer program which builds on 
theory in Muthen (1984) and encompas;es the present type of iiiodel. The 
technical details of our analysis will not be discussed here. The slopes and the 
correlations correspond to different model parts In the LISCOMP framework and can 
be analyzed together or separately. In the present case there are 40 y variables 
(Items) and a tStal of about 50 x and z variables. This yields potentially 2,000 slopes 
and 780 correlations to fit the model to. This yields potentially 2,000 slopes for the 
regression of each y variable on these regressions Involve a nonlinear maximum 
likelihood problt regression with 50 x variables on 3,724 observations anc Is 
therefore computationally burdensome. In order to ease the computational burden, 
only the slope part of the LISCOMP framework will be used to estimate the ys, B's. 
and A's The v parameter will be estimated using both the slope and correlation part 
through a separate analysis on a subset of about half of the Items showing 
particularly good measurement qualities. To further simplify computations, the 
fitting of the model of Figure 1 Is carried out by unweighted least-squares. Still, the 
computations are heavy In that they Involve the estlmaUon of over 100 A B's, and 
x's While the unweighted least-squares estimator does not provide standard errors 
of estlmates, follow-up analyses on subsets of Items by generalized least-squares w.ll 
give Indication of the magnitude of estimates needed for statistically meaningful 
values. 

Analysis Results 

Preliminary analyses were performed by standard IRT techniques. UsI.-ig the 
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two-paranicter logistic fnodel and marginal maxl.-num likelihood estimation provided 
by the W.lOG program (Mlslevy & Bock, 1984), It vas revealed that item 10 was very 
hard and had deficient measurement properties, ilie subsequent analyses were 
performed with only 39 test Items. An Item facte r analysis strongly supported the 
notion of unidimenslonality for this set of items. A scree plot of the latent roots for 
the tetrachorlc correlation matrix is given in i lgurc 2. Note that this is only a rough 
assessment of the dimensionality or the items since the items may rorrelatr not only 
due to the trait but also due to the CTL influence. 

Tlie estimation of the innuoncc of the background variables on the ability 
will be discussed first. Next, the estimates of the measurement parameters relating 
the item responses to the ability will be presented. Finally, we will turn to the 
estimates of primary concern in this paper, namely those representing the effect ot 
instructional sensitivity. 

Kelatiiig the Ability to Background Variables 

The estimates from the regression of the trait on the background variables 
are givei^ in Table 3. Although standard errors of estimates are not provided for this 
model generalized least-squares estimation on a subset of items Indicate that 
estimates larger than pretest variable related to arithmetic dominates the prediction 
of spring math ability. This is natural since this Is the area of mathematics best 
covered up to eighth grade and since performance on these kinds of tasks influence 
the selection of students into more advanced math classes where they get further 
training that enhances their ability. One may note that the prearlthmetlc variable 
correlates 0.76 with the posttest sum of correct answers. Among non-pretest 
variables, finding mathematics useful is the most important one. 

The Y-parameter estimates for the effect of OTL variables on the math ability 
will not be presented here. Overall, the effects are negligible. The prediction of 
math ability by fall measurements is quite successful in that the estimated portion of 
variation in math ability explained by the various background variables is 76%. 

When using the SIMS data to illustrate the approach to assessing instructional 
item sensitivity, Muthen (1987a) only included the OTL variables and not the other 
background variables used here. Given our present model, omitting these other 
background variables would lead to biased estimates of the item parameters and their 
instructional sensitivity. However, we have found that such biases are small for the 
data, probably due to the rather small correlations between the OU variables and 
the other background variables. This is a useful finding for situations where pretests, 
or other early performance measures, are not available. 

Relating the Items to the Ability 

The measurement of the tral* n is reflected In the A parameters represeriting 
the slopes (factor loadings) in the regressions of the latent response variables y on 
the trait ti. The estimates of these are given in Table 4, which also contains the 
estimated values of the threshold x and of the corresponding IRT parameters, one a 
and two b's for each item calculated as in (4) and (5). Table 4 also contains the 
corresponding estimates of IRT parameters a and b as obtained by standard analysis, 
here? carried out by marginal maximum likelihood in the BILOG program (Mislevy & 
Bock, 1984). The wording of each of the 40 items Is given in Appendix B. 

Tabk 4 shows that items 3, 6, 7, 17, 19, 21, 39 have L-values less than or 
equal to 45 and are not good measurements of the math ability trait. It is 
Interesting to note that six of these seven items have geometric or spatial content 
with exception of item 17, all these items had NTL values of at least .25. 




n 



K Is also InteresUng to note that standard IRT estimation of a and b 
parameters « compared to our approach gives results that are rather similar for a but 
quite dlffcrtMU for b. Two explanations may be offered for this. One is that our 
results come from a model that extends the standard IRT to background variables, 
I'lvlng a fuller descilptlon of the trait where it is determined not only by Item 
performance but also by predictors thereof. In statistical terms the model Is strong 
in that the notion of undlmenslonallty Is extended to not o"ly cjPla In Item 
interrelations but also relations between items and predictors. While this is largely 9 
matter of using more information for estimation, the second reason relates to bias in 
the standard IRT estimation due to use of the wrong model. Under a model that 
allows for direct OU Influence on the items, the use of a standard IRT mode ignorys 
both student heterogeneity in the item parameters and that in aodltfon to the trait 
the on, influence also causes dependency among the Items. 

Instructional Sensitivity 

Of greatest Interest in this paper are the estimated B parameters 
representing the direct effects of 011 on the item performance, thereby Indicaiing 
instructional sensitivity in the items. The estimated B's and the corresponding 
measures of distance between the probability curves (item characteristic curves) are 
given in the rightmost part of Table 1. The implications of the estimates in this part 
of the table are best understood by a discussion of the items that show substantial 
instructional sensitivity. 

Consider first item 17. This Is a geometry item which for its correct solution 
requires knowledge of the definition of an acute angle. From Table 1 ""je that 
13% have had no OTL for this item of whom 38% get the item right, while 62% get 
it rlcht with OTL (this vear or prior years). The B estimate for OTL Is positive 
reflecting the extra advantage, over and above what the trait level would predict, ot 
havlne OTL versus not having OTL. Note that while the proportion correct for an 
item is an estimate of marginal probability given the trait and Is therefore the 
appropriate measure of Instructional sensitivity. Several Items have large 
differences in proportion correct for OTL versus no OTl. while having negligible 8 
effects In order ^o gauge the importance of the corresponding shifts in the 
conditional probabilities. Figure 3 shows the standardized probability curves over the 
trait range 3 to +3. 

For an average trait value ot 0, the extra advantage of OTL is estimated as an 
approximate Inaease of 0.15 for the probability of a correct answer. Tlie 
corresponding curve distance (D value) is .32. 

A working hypothesis for a particularly strong reason for instructional 
sensitivity is that vhe item is definitional In nature and represents early learning on 
the topic of angles. It Is therefore rather hard for students who have not been 
exposed to It, while rather easy once exposed to it. A harder Item may show less 
instructional sensitivity since even with OTL many students may get it wryi.g An 
item such as number 17 may be less valuable as an indicator of a more general tuit 
than an Indicator of exposure In a certain limited area. From Table 4 we note that 
item 17 is among tne group of Items that we Identified as having rather poor 
measurement qualities, with an estimated L value of .43 (an estimated a value of 
.45). 

Consider next item 39. As is seen in Appendix B, this item refers to 
knowledge about the coordinate system. Here, a rather large group of 30% have no 
O'lX In terms of proportion correct, the item seems rather easy for the OTL 
category' (0.63) while father hard for no OTL (0.33). mere ^^substantial differeiice 
between the estimated probability curves for OTL versus no OTL (0 35). Like Item 
17, this instructional sensitivity In item 39 seems to correspond to definitional 
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leamlng such that the item becomes quite easy when the student is exposed to this 
knowledge. And the item is a poor Indicator of the trait (see Table 4) and is in fa-n 
the worst one. As shown In the estimated probability c-urves of Figure 5 the 
dlsalmination (the slope) is very small. Tills would mean ttiat getting the Item 
correct involves little of general math ability, but merely indicates the specific 
knowledge of the definition, a plausible explanation for this Item. 

Other items also show substantial instructional sensitivity and may furtlicr 
suDDort the hypothesis of introductory definitional content. To solve item 38, a 
■.tudent needs to know the definition of percentage, followed by a straightforward 
arithmetic operation, item 16 calls for knowledge about multiplying riegative 
Integers and parentheses, and item 3 deals with the sin.pllfication and solution of a 
routine algebra equation. But unlike items 17 and 39 discussed above, items 38 and 
16 provide good measurements of the trait. 

The proposed methodology represents a new way to study the Instructional 
sensitivity of achievement items. Given sufficiently rich data, iristructlonally 
sensitive Items can be detected while at the same time gaining Information about 
the achievement process through the estimation of a comprehensive model that 
goes well beyond those of standard IRT analytical methods for examining 
achievement test data. 

The exact nature of the benefits to be gained from esti.nating the effects of 
instructional opportunities on both the latent ability, depends on the specific 
emoirical context In which the methodology is employed. Naturally, the 
heteroceneity of the pool of achi^'vement items and of the student population 
tested matter. What also matters is adequacy of the specification of the model ot 
achievement and of the measurement of instructional opportunities and other 
characteristics. 

In the present case, there was considerable heterogeneity In the 
mathemadcs instruction experiences of students; some students were still enrolled 
in remedial instruction dominated by arithmetic operations with integers and 
common and decimal fractions when others were enrolled in elementary algebra 
classes The set of test items broadly spanned topics typically covered by l^»e end of 
elementary alge*^ra instruction. Against this backdrop, the model examined here 
featured parameters estimating ihe influence of student background and 
oDDortunlties to learn content pertinent to each specific test item on a single latent 
mathematics ability trait and the effects of the mathematics ability trait and the 
Item-specific OTL on the difficulties of test items. 

Under these modeling conditions, item-specific OTL had limited impact on 
the latent variable representing mathematics ability once student background 
variables (which included pure mathematics performance) were controlled. 
However, for selected test items, there were strong direct effects of latent 
mathematics ability. In other words, the general, presumably more stable 
achievement trait, was Insufficient to account for performance on these iterns. 
According to standard IRT analysis methods, either the IRT resu ts would be biased 
by the inclusion of Item^ or would have been eliminated to avoid violation of IRT 
assumptions. Neither prospect is attractive. 

Clearly the present analysis provides a more detailed way to examine the 
influence of instruction on responses to test items, a matter of considerable interest 
in developing achievement tests and interpreting test results. In the present case 
certain test Items representing early stages of learning about selected mathematical 
topics were particularly sensitive to specific instruction. Individual differences 
represented within the single latent mathematics ability did not adequately account 
for performance differences on these items. 
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What nex* steps to take In response to the Identification of Instructlonally 
sensitive Items Is unclear. An obvious possibility here Is to consider employing a 
multidimensional latent achievement model to represent the domain of test Items. 
Incorporating specific latent factors representing Instructlonally important 
curriculum segi^ients within the psychometric model is both theoretically and 
oractlcally desirable. Presumably, differential instructional exposure should then 
influence the specific factors. Under such cor^ditlons any resHual direct effects of 
OTL on item performance represent teaching to the specifics of the te.st, a typically 
undesirable instructional sttategy. We are currently exploring the Possibility of 
applying models with multidimensional latent achievement traits with the SIMS data 
base. 

Given psychometric methcdology that can better tie test item performance 
to both ability and instruction, the proper measurement and measurement modeling 
of instruction Is highlighted. The above analyses utilized a class level, and rather 
crude OTL variable reported by the teacher. It is recognized that the mixture ot 
student level responses and class level OTL information creates multilevel, or 
hierarchical observations, a problem which we were forced to Ignore In our aiialyscs. 
With few classes In an OTL category, measurement error in the teacher-reported 
OTI may have strong biasing effects. The class level information may also be 
Incorrect for a given student. Student level OTI. is available, but may contain even 
more measurement error. Further substantive research needs to find ways to 
properly combine information of several kinds in order to provide mo;e reliable and 
informative instructional student background. 
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